Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify boundaries of numeric env vars #4331

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jack-berg
Copy link
Member

@jack-berg jack-berg commented Dec 9, 2024

Resolves #4283.

Adds new guidance indicates that for timeout variables, "implementations SHOULD validate that values are positive". Carves out an exception for implementations which have different semantics and feel changing could not be justified as a bugfix:

unless they have good reasons not to (e.g. backwards compatibility with semantics where a negative or zero value meansindefinite)

The original issue was about the semantics of assigning a timeout of 0 or negative values, but figured it was useful to do a pass on all numeric env vars clarify acceptable boundaries.

List of changes below. All properties currently have unspecified boundaries.

  • Jaeger remote sampler pollingIntervalMs: values are positive, since polling with an interval of 0ms is nonsensical.
  • OTEL_BSP_SCHEDULE_DELAY / OTEL_BLRP_SCHEDULE_DELAY / OTEL_METRIC_EXPORT_INTERVAL: values are non-negative, since its valid to immediately start an export 0ms after the prior ended.
  • OTEL_BSP_EXPORT_TIMEOUT / OTEL_BLRP_EXPORT_TIMEOUT / OTEL_METRIC_EXPORT_TIMEOUT: values are positive, following new guidance for timeouts.
  • OTEL_BSP_MAX_QUEUE_SIZE / OTEL_BLRP_MAX_QUEUE_SIZE: values are positive, since setting the queue size to zero is nonsensical.
  • OTEL_BSP_MAX_EXPORT_BATCH_SIZE / OTEL_BLRP_MAX_EXPORT_BATCH_SIZE: values are positive, since setting the batch size to zero is nonsensical.
  • OTEL_ATTRIBUTE_* / OTEL_SPAN_ATTRIBUTE_* / OTEL_LOGRECORD_ATTRIBUTE_*: values are non-negative, since its valid to record 0 attributes.
  • OTEL_EXPORTER_ZIPKIN_TIMEOUT / OTEL_EXPORTER_OTLP_TIMEOUT / OTEL_EXPORTER_OTLP_{SIGNAL}_TIMEOUT: values are positive, following new guidance for timeouts.

Related PR to opentelemetry-configuration: open-telemetry/opentelemetry-configuration#151

@jack-berg jack-berg requested review from a team as code owners December 9, 2024 23:57
@@ -108,6 +109,13 @@ gracefully ignore the setting and use the default value if it is defined.

For example, the value `12000` indicates 12000 milliseconds, i.e., 12 seconds.

### Timeout values

For variables that represent a timeout (e.g. exporter timeout), implementations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the "Duration" section right above the "Timeout values" section, it has mentioned things such as "an integer representing a number of milliseconds" and "if a negative value is provided, the implementation MUST generate a warning". Are "Duration" and "Timeout values" mutually exclusive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A timeout is a type of duration. So maybe I should move this content to the duration section, or as a subsection within duration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A timeout is a type of duration. So maybe I should move this content to the duration section, or as a subsection within duration.

Yep, that'll bring clarity, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A timeout is a type of duration.

Am I correct that _DELAY are the only configurations that of type duration but are not of type timeout? Do we want to have zero delay?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to have zero delay?

I came to the conclusion that while its odd, its still valid-ish. Take the batch span processor's delay interval, which represents the gap in time from the conclusion of one export to the start of the next. Assume that exports over the network take some amount of time like 2ms. Setting the batch span processor's delay to zero is just saying: "as soon as one export resolves, start the next".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on supporting zero delay, folks from the operating system background would find it common https://learn.microsoft.com/windows/win32/api/synchapi/nf-synchapi-sleepex#parameters.


For variables that represent a timeout (e.g. exporter timeout), implementations
SHOULD validate that values are positive unless they have good reasons not to (
e.g. backwards compatibility with semantics where a negative or zero value means
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will lead to inconsistency and confusion, let's say if Java SDK decided to treat -1 as indefinite while Python SDK decided to fall back to defaults (e.g. 30,000 milliseconds).

I think it is better to ask all SDKs to have a consistent behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to ask all SDKs to have a consistent behavior.

Consistency is best, but its difficult to make consistency recommendations post hoc. Consider our change in advice to make http/protobuf default OTLP protocol, when it was previously grpc. We have portions of the ecosystem which made the switch, either because they went stable after the change in advice or decided the change wasn't impactful enough to be breaking for users, and other portions which stuck with grpc as the default (like opentelemetry-java).

If all maintainers agree that changing their semantics is a bugfix, let's get rid of this qualification.

Copy link
Member

@reyang reyang Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the grpc and http/protobuf example. However, I think for timeouts there is a high chance that we can achieve consistency. For example, if 3 language implementation SIGs are saying "-1 means indefinite to us" while other SIGs treat negative values as invalid, it is possible to have all implementations treating -1 as indefinite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if 3 language implementation SIGs are saying "-1 means indefinite to us" while other SIGs treat negative values as invalid, it is possible to have all implementations treating -1 as indefinite.

Its true. Its always acceptable to loosen the restriction.

Copy link
Member

@pellared pellared Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should define it similarly like in other types e.g.

The value is positive - if a negative value is provided, the implementation MUST generate a warning, gracefully ignore the setting and use the default value if it is defined; if a zero value is provided, the implementation SHOULD (not "MUST" because of backwards compatibility when it is used to represent infinity or zero) generate a warning, gracefully ignore the setting and use the default value if it is defined.

let's say if Java SDK decided to treat -1 as indefinite

This should be seen as a bug given the definition of the duration type:

The value is non-negative - if a negative value is provided, the implementation MUST generate a warning, gracefully ignore the setting and use the default value if it is defined.

The only issue is when 0 is handled as Infinite or zero. See:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize the arguments:

  • If negative, generate a warning and use the default. This is already the specified behavior for duration properties, which includes timeout.
  • If zero, interpret to mean indefinite. There is prior art to support this interpretation. Also, its less controversial for an implementation which currently rejects 0 as invalid to begin accepting it than it is for an implementation which currently accepts zero to begin rejecting it. I.e. loosening restrictiveness is always ok. Increasing restrictiveness may be interpreted as a breaking change.

Given these arguments, does anyone feel strongly that we should reject 0 as invalid? Tagging folks who originally supported this idea: @tigrannajaryan, @lmolkova, @cijothomas, @pellared

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If negative, generate a warning and use the default. This is already the specified behavior for duration properties, which includes timeout.

👍

If zero, interpret to mean indefinite.

Looks good for Go. However, what about .NET? CC @cijothomas @reyang

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyang
Copy link
Member

reyang commented Dec 10, 2024

Need a changelog entry.

carlosalberto added a commit that referenced this pull request Dec 21, 2024
Related to #4283.

A
[comment](#4331 (comment))
adding a "type" column to each env var, but didn't feel appropriate to
extend scope of #4331.

---------

Co-authored-by: Carlos Alberto Cortez <[email protected]>
Co-authored-by: Reiley Yang <[email protected]>
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarify that exporter timeout settings must be positive
5 participants